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ABSTRACT 

We develop a method for improving the parallel scalability of 
the recently developed parallel selected inversion algorithm 
[Jacquelin, Lin and Yang 2014], named PSellnv, on mas¬ 
sively parallel distributed memory machines. In the PSellnv 
method, we compute selected elements of the inverse of a 
sparse matrix A that can be decomposed as A = LU, where 
L is lower triangular and U is upper triangular. Updat¬ 
ing these selected elements of A~^ requires restricted col¬ 
lective communications among a subset of processors within 
each column or row communication group created by a block 
cyclic distribution of L and U. We describe how this type 
of restricted collective communication can be implemented 
by using asynchronous point-to-point MPI communication 
functions combined with a binary tree based data propaga¬ 
tion scheme. Because multiple restricted collective commu¬ 
nications may take place at the same time in the parallel 
selected inversion algorithm, we need to use a heuristic to 
prevent processors participating in multiple collective com¬ 
munications from receiving too many messages. This heuris¬ 
tic allows us to reduce communication load imbalance and 
improve the overall scalability of the selected inversion al¬ 
gorithm. For instance, when 6,400 processors are used, we 
observe over 5x speedup for test matrices. It also mitigates 
the performance variability introduced by an inhomogeneous 
network topology. 
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1. INTRODUCTION 

Collective communication such as broadcast and reduction 
is an ubiquitous type of communication used in many paral¬ 
lel programs. When such communication is required among 
all processors that belong to a communication group la¬ 
beled by a communicator, one can use standard message 
passing interface (MPI) functions such as MPLBcast and 
MPLReduce.The MPI libraries available on most of high per¬ 
formance computers often provide highly efficient implemen¬ 
tations of these functions. These implementations typically 
make use of a tree-based algorithm that minimizes the total 
communication volume and the number of messages. 

However, in some applications, collective communication is 
required only among a subset of processors within a prede¬ 
fined communication group, and this subset of processors 
may change over time. One such application is the pole ex¬ 
pansion and selected inversion method that can be 

used to accelerate Kohn-Sham density functional theory 
based electronic structure calculations. Because the current 
MPI standard does not support collective communication 
among an arbitrary subset of processors, one must resort 
to other mechanisms to accomplish such a communication 
task. 

One possible solution is to determine all collective communi¬ 
cation calls that will be needed in advance and the processors 
involved in each one of these calls, set up multiple commu¬ 
nication groups, and use them whenever they are needed. 
However, the total number of communication groups needed 
(e.g., in the selected inversion algorithm) may exceed the 
capacity of the MPI libraries, which is typically around sev¬ 
eral thousands (currently 4,096 on Cray MPI for instance). 
Hence the approach of pre-allocating all communicators is 
not feasible for all applications. 


Another approach is to create communication groups dy¬ 
namically as they are needed, and release them when they 
are no longer needed. However, this approach typically in¬ 
curs a signihcant amount of overhead that interferes with 
the asynchronous nature of the parallel selected inversion 
algorithm, and thus limits its parallel scalability on large 
scale distributed memory machines. 

Yet another solution is to replace the collective communi¬ 
cation altogether with point-to-point communications. Al¬ 
though this approach is plausible when each collective com¬ 
munication involves only a few processors distributed among 
a small network of processors, it quickly becomes inefficient 
when the number of processors involved in the communi¬ 
cation becomes large. One main pitfall of this approach is 
that communication load is not well balanced among dif¬ 
ferent processors. Such imbalance can severely impair the 
overall parallel performance. Furthermore, when executed 
on massively parallel machines that have a hierarchical and 
inhomogeneous network architecture, such an approach also 
introduces performance variability. 

However, improvement can be made to reduce the overall 
communication cost if we orchestra the point-to-point com¬ 
munication in such a way that mimics the collective commu¬ 
nication implemented in standard MPI libraries. That is, if 
we combine a tree-based algorithm with asynchronous point- 
to-point sends and receives, we can effectively construct dy¬ 
namic collective communications among an arbitrary subset 
of processors. 

In this paper, we demonstrate that the use of the third op¬ 
tion can be quite effective. However, it needs to be imple¬ 
mented with care to accommodate a special feature of the 
selected inversion algorithm that allows several restricted 
collective communications to take place at the same time. 
We present a heuristic that prevent processors participat¬ 
ing in multiple collective communications from receiving too 
many messages. We show that this heuristic is very effective 
in reducing the amount of load imbalance. It also reduces 
performance variability induced by an inhomogeneous net¬ 
work architecture. 

Our paper is organized as follows. In the next section, we 
briefly describe the selected inversion algorithm and its par¬ 
allel implementation. We point out the nature of collec¬ 
tive communications required in the parallel implementa¬ 
tion that calls for the implementation of customized broad¬ 
cast and reduction operations built on top of asynchronous 
point-to-point communications. We discuss the construc¬ 
tion of binary trees to propagate data among different pro¬ 
cessors and the heuristic for improving communication load 
balance. In section]^ we report the performance improve¬ 
ment achieved by this technique. In particular, we show 
that the implementation of dynamic collective communica¬ 
tion allows the selected inversion method to scale efficiently 
beyond 4000 processors. The wall clock time can be reduced 
by more than a factor of three, and the overall variation in 
runtime is also reduced. This improvement enables us to 
use selected inversion based electronic structure calculation 
to over 100,000 cores with the pole expansion and selected 
inversion technique . 


2. SELECTED INVERSION 

Let A G ^ non-singular sparse matrix. We use 

Aij to denote the (i,j)-th entry of the matrix A, and A;,* 
and to denote the i-th row and the j-th column of 
A, respectively. We are interested in computing selected 
elements of A~^, dehned as 

{(A~^)ij\ for such that Aij 0}. (1) 

Sometimes, we only need to compute a subset of these se¬ 
lected elements, for example, the diagonal elements of A~^. 
The most straightforward way to obtain these selected ele¬ 
ments of A~^ is to compute the full inverse of A and then 
extract the selected elements. But this is often prohibitively 
expensive in practice. If a sparse LU factorization of A is 
available (or LDL^ factorization if A is symmetric) , a more 
efficient way to achieve this goal is to use an algorithm that 
makes efficient use of the sparse L and U factors of A. In 
such an algorithm, which we call selected inversion (Sellnv) 
some additional elements of A~^ may need to be computed. 
However, the overall set of nonzero elements that need to be 
computed often remains a small percentage of all elements 
of A~^ due to the sparsity structure of A. 

The selected inversion algorithm and its variants have been 
discussed in a number of publications |10[ |11[ |12| |13[ 

|14| |15| |16[ 1^ 1^ . We review the basic ingredients of this 

algorithm in section [271] and describe the recently developed 
parallel algorithm in section [27^ 


2.1 Sequential algorithm 

The selected inversion algorithm can be derived as follows. 
Given a 2-by-2 block partitioning of matrix A of the form 


A = 


fAi,i 

V^2,l 


A2,2 J ’ 


( 2 ) 


where Ai^i is a scalar entry of A. Ai^i can be expressed as a 
product of two scalars Z/i,i and t/i,i. In particular, we can 
pick Li,i = 1 and f/i,i = Aiq. Then 


/Li,i 0\ /C/i,i UiA 

[L 2 ,i iJ [ 0 82 , 2 ) 


(3) 


where 


L 2,1 = A2,i(?7i,i) 


Hi,2 = (Li,i) ^Ai, 2. 


(4) 


The L and U factors are usually directly accessible in a stan¬ 
dard LU factorization, and 


S' 2,2 = A 2,2 — L 2 ,iHi ,2 (5) 


is the Schur complement. Using the decomposition given by 
Eq. ([^, we can express A~^ as 


A-^ 


-S^2L2,l(il,l)“^ 


-(t/l,l)-Wl.252') 

^ 2:2 


( 6 ) 


Since 82,2 is the same as 8 here, without ambiguity 822 = 
{ 8 ~^) 2,2 can be used. To simplify the notation, we define 
the normalized LU factors as 


Li,i — Li,i, Hi,i — Hi,i, 

L 2,1 = L2,i(Li,i)“\ Hi.2 = (Hi,i)-1C/i,2, 


( 7 ) 


[SlIISS [5] BEEaO 


and Eq. fef can be equivalently given by 

1 ^ + 17i.2S2jL2,i -U^,2Sll\ 

\ -SllL 2 ,i ) ■ 

( 8 ) 

Let us denote by C the set of indices 

{i|(L2.i),/0}U{i|(C/i.2)^^/0}, (9) 

and assume 5^2 has already been computed. From Eq. ||^ it 
can be readily observed that in order to compute the selected 
elements of = — ^•S'^ 2 ^ 2 ,ij for i £ C, we only need 

the entries 

{(5'2:2^)^ .|iec,jec}. (10) 

The same set of entries of S 22 are required to compute se¬ 
lected entries of A ^2 = ~t^i, 2 <S'^ 2 ' hlo additional entries of 
S 22 are needed to complete the computation of A^\, which 
involves the matrix product of selected entries of [/i ,2 and 
A^i- This procedure can be repeated recursively to com¬ 
pute selected elements of S 22 until 82,2 is a scalar of size 1. 
A pseudo-code for demonstrating this column-based selected 
inversion algorithm for symmetric matrix is given in [^. 


Algorithm 1: Selected inversion algorithm based on LU 
factorization. 


1 


2 


3 

4 

5 


(1) The supernode partition of columns of A: 

Input- {!> 2,..., A/”} 

^ ’ (2) A supernodal LU factorization of A with (un¬ 

normalized) LU factors L and U. 

Output: Selected elements of A~^ , i.e. AJ ^ such that 
Li,j is not an empty block, 
for K. = JV,Af — 1,..., 1 do 

Find the collection of indices 

C = {I \ X > 1C, Lxx is a nonzero block} U {^J \ J > 
K,, Uk.,j is a nonzero block} 

Le,K ■£- Lc,k{Lk.,k)~^,Uk,c ■£- {Ukx)~^Uk.,c 

end 

for tC = Af,Af — 1,..., 1 do 

Find the collection of indices 

C = {X \ X > K,, Lxx is a nonzero block} U {J \ J > 
YC, Uk.,j is a nonzero block} 

Calculate A^ ^ (^Lcx 

Calculate A^,- ^ 

Calculate A^ ^ ^— — U }c,c Aq ^ 

end 


In practice, a column-based sparse factorization and selected 
inversion algorithm may not be efficient due to the lack of 
level 3 BLAS operations. For a sparse matrix A, the columns 
of A and the L factor can be partitioned into supernodes. 
A supernode is a maximal set of contiguous columns X = 
{j) i + li ■ • ■ li + s} of the L factor that have the same nonzero 
structure below the {j -I- s)-th row, and the lower triangu¬ 
lar part of Ljx is dense. This definition can be relaxed to 
limit the maximal number of columns in a supernode (i.e. 
sets are not necessarily maximal). With slight abuse of no¬ 
tation, both a supernode index and the set of column indices 
associated with a supernode are denoted by uppercase script 
letters such as X,X,1C etc.. Ax.* and A*x are used to de¬ 
note the X-th block row and the X-th. block column of A, 
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ample of a 2D 
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(b) 2D block cyclic distribu¬ 
tion of PSellnv sparse ma¬ 
trix data structure on a 4- 
by-3 processor grid 


P, 

P, 









p. 

A 



11"- 1 









4 









I\2 


Pu 







A 


V 

A 




A 







3 


Ps 


J1 








I'- 








1 

n. 



T[ 





Pi 



A 

A 







ft 



ft 



Figure 1: Data layout of the internal sparse matrix data 
structure used by PSellnv. 


respectively. Aj ^ denotes the {X, ^)-th block of the matrix 
A“^, i.e. Aj^ = {A~^)x,j. When the block Ax.j itself is 
invertible, its inverse is denoted by {Ax,j)~^ to distinguish 
from Aj J. 

Using the supernode notation, a pseudo-code for the selected 
inversion algorithm is given in Algorithm 


2.2 Parallel selected inversion algorithm 

In this section we briefly discuss the parallel implementa¬ 
tion of the selected inversion algorithm, called PSellnv, on 
distributed memory parallel machines. More details of the 
implementation for symmetric matrices can be found in . 


PSellnv uses the same 2D block cyclic distribution scheme 
employed by SuperLU_DIST to partition and distribute 
both the L factor and the selected elements of A~^ to be 
computed. Before the factorization, columns of A, L and U 
are partitioned into supernodes of various sizes. This par¬ 
tition is applied to the rows of the input matrix to create 
a 2D block partition. These blocks are cyclically mapped 
onto processors that are arranged in a virtual Pr-by-Pc 2D 
grid. The mapping itself does not take the sparsity of the 
matrix into account, however only non-zero elements are ac¬ 
tually stored. As an example, a 4-by-3 grid of processors is 
depicted in Figure [I(a)[ The mapping of the 2D supernode 
partition of the matrix on the 2D processor grid is depicted 
in Figure 1(b) Each supernodal block column of L is dis¬ 
tributed among processors that belong to a column of the 
processor grid. Each processor may own multiple matrix 
blocks. For instance, nonzero rows in the second supernode 
are owned by processors P 2 and P5. 


We execute the first loop of Algorithmin a separate pass, 
since the data communication required in this step is rel¬ 
atively simple. The processor that owns the block Ljc.k 
broadcasts it to all processors that own nonzero blocks Lxx 
in the supernode K, within the same processor column. Each 
of those processors performs the triangular solve Lxx = 
Lxx{^k.x)~^ foi' each nonzero block contained in the set C 
defined in step of the algorithm. Because Lxx uot used 
in the subsequent steps of selected inversion, it is overwrit¬ 
ten by Lxx- Since communication is limited to a processor 





































Figure 2: Task parallelism and communication pattern for 
the supernode [^. There are 6 steps: @ broadcast L, (T) 
compute A~^L, reduce A~^L, @ compute L^A~^L, 
reduce VA~^L and @ update A~^. 


column group only, multiple supernodes can be processed at 
the same time. 

A more complicated communication pattern, which also turns 
out to be the most time consuming step in terms of data 
communication, is required to complete step in parallel. 
A^^ and Lex generally owned by different processor 
sets. In PSellnv, we choose to send the Lex matrix blocks 
to processors owning the matching blocks of to per¬ 
form the matrix-matrix multiplication, i.e. a particular ma¬ 
trix block Lix needs to be communicated to all processors 
within the same column group of processors among which 
A^\ is distributed. 

However, since the processor owning Lxx is generally not 
in the same processor row/column group that owns 
the communication cannot be performed by using a simple 
broadcast procedure. We briefly describe the communica¬ 
tion pattern for this step here, since most of the communi¬ 
cation cost is spent on this step. In PSellnv, we use point- 
to-point MPI sends that originate from the processor owning 
Lxx lo the group of processors holding A^^. 

For symmetric matrices, as soon as Lxx becomes available, 
as illustrated above, it is sent to the processor owning Uic,Xj 
which is then overwritten by Lx tc- Once Lxx has been sent 
the processor mapped to the upper triangular part of the ma¬ 
trix, stepj^of Algorithm^can be performed. Uk,i = Lxx 
first sent to all processors within the same processor column 
that owns Uic,x- The matrix-matrix product Aj^j-Lxx is 
then performed locally on each processor owning Aj^j- using 
BLASS kernel. Then, contributions Aj^j-Lxx are reduced 
within each processor rows owning Ljx- The block 

in step of Algorithm is now fully computed. 

Figure [^illustrates how this step is completed for supernode 
K. = [^. We use circled letters (^, @ to label communi¬ 
cation events, and circled numbers (T), @ to label com¬ 


putational events. Ue,8 = L^ g is sent by Pg to all processors 
within the processor column. This group include both Pg 
and Pii. Similarly Lio.e is broadcast from P 4 to all other 
processors within the processor column. Local GEMMs are 
then performed on Pn, Pio, P4 and P5 simultaneously, be¬ 
fore being reduced onto P12 and P5 within their respective 
processor row. After this step, A^g and A^gg become avail¬ 
able on P12 and Pg respectively. 


In 17 , we pointed out that an additional coarse-grained 
level of parallelism exists in the second loop of Algorithm 
Different loop iterates can be executed simultaneously if 1) 
there is no data dependency among these iterates; 2 ) there 
is no overlap among the processors that own data blocks 
belonging to these loop iterates. 


The absence of data dependency among different loop iter¬ 
ates results from the sparsity structure of A and its L and 


U factors, and can be exposed by the elimination tree 19 


associated with a sparse LU factorization. Although two 
supernodes belonging to two different branches of the elimi¬ 
nation tree would need to communicate with their common 
ancestors in the selected inversion algorithm, these commu¬ 
nications do not have to take place at the same time. Thus, 
it is still possible to process these two supernodes simulta¬ 
neously. However, due to the 2D cyclic distribution of the 
supernodes, it is possible that some of the matrix blocks be¬ 
longing to two independent supernodes are owned by the 
same processor. In that case, full parallelism cannot be 
achieved between the two supernodes. 


In PSellnv, we do not explicitly use the MPLBarrier func¬ 
tion for synchronization. The synchronization is only im¬ 
posed through data dependencies. As a result, tasks as¬ 
sociated with different supernodes can be executed concur¬ 
rently if these supernodes are on different critical paths of 
the elimination tree, and if there is no overlap among pro¬ 
cessors mapped to these critical paths. In this sense, the 
asynchronous task formulation tries to achieve two goals: 
pipelining computations and overlapping communication 
with computations. 


3. COLLECTIVE COMMUNICATION IN 
PARALLEL SELECTED INVERSION 


We can clearly see that the parallelization strategy we pre¬ 
sented in section [T 2 ] involves a fair amount of data commu¬ 
nication. Most of this communication occurs in the second 
loop of Algorithm and in particular, step [^ of the algo¬ 
rithm. Therefore, the performance of our parallel implemen¬ 
tation of the selected inversion algorithm depends critically 
on how Lxx i® matching blocks of A^j, and how 

local products A'^^j-Lxx ^'f® reduced to the processor that 


owns the Ljx block in supernode K.. These data commu¬ 
nications are collective in nature. However, they should 
be carefully treated in the sense that each broadcast op¬ 
eration labeled by (?) (or reduction operation labeled by 
(^) in Figure involves only a subset of processors within 
a column or row processor group defined within a virtual 
2D processor grid shown in Figure 1(b) We will call this 
type of collective communication restricted collective com¬ 
munication. Furthermore, in PSellnv, the subset of proces¬ 
sors involved in restricted collective communications varies 
































when different supernodes are processed due to the general 
sparsity structure of L and U. 

As we indicated in the introduction, one way to implement 
these collective communication is to construct all possible 
communication groups, each containing a subset of proces¬ 
sors involved in each broadcast and reduction operation 
respectively, in advance, and use MPI_Bcast and MPI_Reduce 
functions available in standard MPI libraries to perform 
these collective communications. However, for large prob¬ 
lems, the number of communication groups required often 
exceeds what most MPI libraries can provide. Besides the 
overhead for creating a large number of MPI communicators 
is non-negligible. Thus, this approach is not viable. 

Although it is possible to create these communication groups 
dynamically, frequent creation and release of communication 
groups tends to result in an excessive amount of overhead. 
Even if this overhead is negligible, using MPI_Bcast and 
MPI_Reduce is still not optimal because the collective and 
blocking nature of these functions reduces the opportunity 
to exploit the loop level concurrency available among differ¬ 
ent supernodes. The subset of ranks involved in one broad¬ 
cast may be different from the subset involved in another 
broadcast, but it is highly likely that the at least some of 
the ranks in one broadcast will also be in another broadcast. 
Consequently, the broadcast of one block cannot proceed un¬ 
til the previous broadcast completes, making the pipelining 
of updates of different supernodes in an asynchronous fash¬ 
ion more difficult to achieve. Ideally, we would like to have a 
set of light-weight asynchronous broadcast and reduction 
functions that can be dynamically created with very little 
overhead. 

We also note that the group of processors involved in each 
collective communication is determined once the L, U factors 
and the 2D processor mapping is given, and therefore no fur¬ 
ther communication is needed to set up the tree once the list 
of processors is known. With such a list, the tree structure 
can be created dynamically with very small overhead. The 
buffer arrays for performing the selected inversion is also 
created dynamically. Such functions are currently not avail¬ 
able in standard MPI libraries. Therefore, we decided to 
implement this type of restricted collective communication 
through the use of point-to-point asynchronous communica¬ 
tion functions such as MPI_Isend and MPI_Irecv. 

In the implementation we presented in [17| , we simply is¬ 
sue multiple MPI_Isend’s to send Li,)c from one processor 
to other processors that own different matching blocks of 
Similarly, multiple MPI_Irecv’s are issued to accu¬ 
mulate the products Aj^j-L j^k, on the processor that owns 
Aj^j^. By using this simple strategy, we were able to per¬ 
form parallel selected inversion efficiently on 256 ~ 1,024 
processors depending on the sparsity pattern and the size of 
a matrix. However, when a larger number of processors are 
used, the performance of parallel selected inversion quickly 
deteriorates. The performance profile we measured indi¬ 
cated that communication cost became the dominant cost 
in those cases. For instance, for the DG_PNF14000 matrix 
in section when a relatively small number of processors 
P = 256 is used, the communication cost is 27%, and the 


time spend on the computation, mainly the matrix-matrix 
multiplication (GEMM) routine is 73%. When a large num¬ 
ber of processors P = 4,096 is used, the communication cost 
is 89%, and the time spend on the computation, mainly the 
matrix-matrix multiplication (GEMM) routine is only 11%. 

A closer look at the communication profile reveals that the 
increase of communication cost is partly caused by a large 
variation in communication volumes consumed by different 
processors. 


The communication imbalance is exacerbated by the in¬ 
homogeneity of the communication bandwidth and latency 
among different nodes and processors. Figure 8(a) shows 
that as the number of processors increases, the amount of 
run time variation also increases when we ran the same 
executable and input matrices multiple times. Since PSellnv 
is a deterministic algorithm, the run time variation is likely 
caused by variation in the communication bandwidth and la¬ 
tency among different combination of nodes and processors 
and the actual mapping between the 2D virtual processor 
grid and the physical layout of the processors. 


Any network will have different levels of locality, and the re¬ 
alities of packaging dictate that the distance between compu¬ 
tational nodes will vary and that subsets of nodes will share 
routers while other nodes will not. In most MPI implemen¬ 
tations, ranks are assigned so that consecutive ranks first fill 
up a node, and then fill the closest node physically, and so 
on. It is thus very likely that jobs placed on machines ranks 
that are logically close in MPI_CDMM_WORLD are also physically 
close to each other. Therefore the goal of our broadcast im¬ 
plementation should be to minimize the amount of data that 
needs to be transferred at long distance, both logically and 
physically, while at the same time avoiding hot spots in the 
network. 


To reduce communication imbalance and consequently the 
communication cost, we modify the way the collective com¬ 
munication in step 3 of Algorithm is implemented. In¬ 
stead of using a “centralized” sender/receiver model for the 
broadcast and reduction in which the communication path 
can be described a Flat-Tree shown in Figure [3(a) 


a binary-tree based algorithm commonly employed in the 
implementation of MPI_Bcast and MPI_Reduce. Gompared 
to the Flat-Tree model, which puts a heavy load on the 
root, the Binary-Tree based scheme reduces the total vol¬ 
ume sent/received from the root from p — 1 messages to 
two messages, and spreads the total communication volume 
among more processors. Figure |3(a)| and |3(b)^ hows how 
messages are passed among different processors in a Flat- 
Tree and Binary-Tree based broadcast operation. 


In order to implement a non-blocking Binary-Tree based 
collective communication scheme, destination processor of 
each message has to be specified in a hierarchical fashion. 
We will refer to processors that lie between the root and the 
leaves of the tree as “internal nodes”. Such processors serve 
as the forwarding processors. The Binary-Tree is built by 
repeatedly splitting the ordered list of ranks in two parti¬ 
tions, and chose the first rank in each half to be the internal 
nodes at the current level. 










The Binary-Tree has two main benefits. First, the large 
reduction in the number of messages sent from the root 
greatly reduces the chances of an instantaneous hot spot 
in the network around the root node. Second, the Binary- 
Tree greatly increases the chance that data is exchanged 
between two ranks that are logically closer, and thus likely 
physically closer in the network, by putting them in the 
same partition. As an example, Figure [3(b)| shows that the 
processors P\ — Pe are involved in a broadcast operation, 
with P 4 being the root. The Flat-Tree simply sends data 
from Pi to all processors other than P4. The Binary-Tree 
uses a pre-designed ordering, i.e. Pi first sends to Pi and 
P 5 which are of distance 1 from the root Pi. The data is 
further broadcast from Pi to P 2 , f’s, and from P 5 to Pg. The 
data communication can be performed recursively for deeper 
trees. 

The drawback of the Binary-Tree is that due to the pipelin¬ 
ing of different loop iterates in the outer-loop of Algorithm]^ 
each processor may participate in several non-blocking re¬ 
stricted collective communication simultaneously. If that 
processor is an internal node in several binary communi¬ 
cation trees, the total volume from those many broadcasts 
passing through that processor can be much larger than that 
sent by other processors. One can see that with this scheme, 
the highest numbered rank in a column will never be chosen 
as an internal node and thus will never forward any data. 
On the other hand, the lowest numbered rank in a column 
will always be chosen as an internal node and thus will al¬ 
ways forward data. While the exact ranks chosen as internal 
nodes will vary depending on the root, patterns of communi¬ 
cation intensity will develop throughout the range of ranks. 
Such a striped pattern is clearly seen in the communication 
volume heat map seen in Figure 


To alleviate this problem, we use a heuristic method that in¬ 
volves applying a random circular shift to the list of receiving 
processor ranks. Such a procedure, referred to as Shifted 
Binary-Tree, is depicted in Figure [3(c)| A random position 
is selected in the sorted list of ranks and a circular shift is 
then executed around this position. The random shift makes 
it therefore less likely that same processors will be picked as 
internal nodes when building multiple binary trees. The ra¬ 
tionale of this circular shift is therefore to smooth the total 
communication load across all processors. 

In our example in Figure |3(c)[ the Shifted Binary-Tree 
breaks the pre-designed and monotonically increasing or¬ 
dering of ranks involved in the tree, and picks a random 
processor other than the root (P4) to be the first child. The 
rest of the processors follow circularly, so that the sequence 
Pi,P 6 ,Pi,P 2 ,P 3 ,P 5 can be regarded as a re-ordered list to 
generate the Binary-Tree. This results in a different data 
communication pattern. 

One can also consider using a fully random permutation of 
processor ranks. However, such a permutation would re¬ 
duce network locality by putting ranks which are logically 
“closer” far from each other. Moreover, our experiments 
show that this approach leads to deteriorated load balanc¬ 
ing in terms of communication volume compared to Shifted 
Binary-Tree. 



(a) Flat-Tree (b) Binary-Tree Shifted Binary-Tree 


Figure 3: Various possible tree-based communication pat¬ 
terns for the broadcast operation. 


Both these binary tree structures do an excellent job of re¬ 
ducing “hot spots” in the network and reducing the commu¬ 
nication distance for data transfer. As already discussed, the 
simple Binary-Tree structure reduces the number of mes¬ 
sages transferred to/from the root from p — 1 to just two, 
greatly reducing the chance of an instantaneous hot spot 
developing at the time of that broadcast. It also increases 
the chances that data will be transferred between ranks that 
are logically close to each other, rather than all data being 
transferred from the root to the leaf no matter the distance 
between the leaf and the root. 

The Shifted Binary-Tree further reduces the chances of 
hot spots in the network given that there are many broad¬ 
casts happening simultaneously. The communication heat 
map in Figure [5(c)| clearly shows that the communication 
volume spreads much more evenly across all ranks. The 
maximum amount of data sent by any rank is also reduced. 
Note that the circular shift potentially reduces network lo¬ 
cality by putting the highest rank in the list before the lowest 
rank. However, this does not negatively impact the algo¬ 
rithm in any significant way, since the root and the next 
level of internal nodes were not guaranteed to be close to 
each other when the number of processors involved in a com¬ 
munication is relatively large. 

4. NUMERICAL RESULTS 

We now report the outcome of a number of computational 
experiments conducted to analyze the communication vol¬ 
ume and pattern of PSellnv, and to evaluate and compare 
the efficiency of different ways to implement the restricted 
collective communication pattern required in PSellnv. 

In all of our experiments, we used the NERSC Edison plat¬ 
form with Cray XC30 nodes. Each node has 24 cores parti¬ 
tioned among two Intel Ivy Bridge processors. Each 12-core 
processor runs at 2.4GHz. A single node has 64GB of mem¬ 
ory, providing more than 2.6 GB of memory per core. 

To evaluate the performance of PSellnv, we use two matri¬ 
ces of different sizes and sparsity patterns. The 
DG_PNF14000 matrix is generated from the electronic struc¬ 
ture calculation of a 2D phosphorene nanoflake with 14, 000 
atoms. The matrix is a discretized Kohn-Sham Hamilto¬ 
nian obtained from an adaptive local basis expansion scheme 
combined with a discontinuous Galerkin framework . This 
matrix is relatively dense. The matrix size is 512,000, with 
0.2% nonzeros in A and 1.3% nonzeros in the L and U fac- 
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Figure 4: Communication volume distribution of Col-Bcast 


tors. The second matrix is named audikw_l matrix obtained 
from the University of Florida matrix collection [^. This 
matrix is relatively sparse. The matrix size is 943,695, with 
0.009% nonzeros in A and 0.3% nonzeros in the L and [/ 
factors. These two matrices represent two different sce¬ 
narios in terms of total communication volume. For the 
DG_PNF14000 matrix, the communication volume of is ex¬ 
pected to be very large. This can lead to imbalanced data 
communication on different processors. For the audikw_l 
matrix, the scalability of PSellnv is more limited by a larger 
communication over computation ratio. 


4.1 Communication load analysis 

The first set of experiments aims at analyzing the communi¬ 
cation load among different processors, and comparing the 
efficiency of using different types of tree structures to im¬ 
plement restricted collective communications by using asyn¬ 
chronous MPf functions. We report the total data volume 
sent from each processor on a 64-by-64 = 4,096 processor 
grid for the audikw_l matrix. 

Our main focus is the broadcast of blocks of Lx,k within 
each column group, and the reduction of A~^.j^Lx,k within 
each row group. These two operations are the most expen¬ 
sive communication steps of PSellnv, and we refer to them 
as Col-Bcast and Row-Reduce respectively. We report the 
minimum and maximum outgoing volume of data among all 
processors in Table for different types of tree-based collec¬ 
tive communication schemes. 


In Figure 5(a)[ we report the volume sent during Col-Bcast 
using the Flat-Tree pattern, as used in the PSellnv de¬ 
veloped in (currently released under the PEXSI pack¬ 
age vO.7.3, referred to as PSellnv vO.7.3 below). We ob¬ 
serve that the data volume associated with processors near 
the diagonal of the 2D processor grid is significantly higher 
than those associated with off-diagonal processors. Signif¬ 
icant variation of communication volume can also be seen 
among the diagonal processors themselves. Furthermore, 
from the distribution of the processor load shown in Fig¬ 
ure |4(a)[ we observe that some processors send more than 
twice the average volume of data sent by all processors. This 
load imbalance creates contention on the network and limits 
the strong scalability of PSellnv. 


When a simple Binary-Tree is used to organize the way mes¬ 


Communication tree 

Min 

Max 

Median 

Std. dev 

Flat-Tree 
Binary-Tree 
Shifted Binary-Tree 

156.951 

5.89874 

238.647 

600.168 

708.268 

363.336 

291.595 

288.851 

298.58 

72.048 

226.565 

24.0957 


Table 1: Volume sent during Col-Bcast (in MB) for the 
audikw_l matrix. 


sages are sent from the root to other processors involved in 
the restricted collective communication, load imbalance can 
still be observed from the heat map given in Figure [5(b)[ It 
can be seen from Table[^and Figure 4(b) that the maximum 
communication volume among all processors and the stan¬ 
dard deviation are actually higher than those in the Flat- 
Tree based communication. Although most of the nodes see 
their load decreased and the median communication volume 
reduced from 291 MB to 288 MB, ranks in the last quartile 
of the most loaded processors send more than 536 MB of 
data instead of the 331 MB of data sent using a Flat-Tree 
based scheme. This observation confirms that some proces¬ 
sors participate in multiple Binary-Tree based broadcasts 
as internal nodes. 


In order to reduce the likelihood of a processor being chosen 
repeatedly as an internal node, we introduced the Shifted 
Binary-Tree communication scheme in Section]^ The com¬ 
munication volume heat map associated with this scheme is 
shown in Figur e |5(c)| We use the same colorbar that we 
used for Figure 5(a) so that we can compare these two heat 
maps directly. We can clearly see that the overall heat map 
is much “cooler”. The communication “hot spots” appear 
to be eliminated by the Shifted Binary-Tree. As we ex¬ 
plained in section]^ the reduction in the overall communi¬ 
cation load and the removal of the “hot spots” result from 
shifting processor ranks in such a way that different pro¬ 
cessors are picked as internal nodes of different communica¬ 
tion trees. The effect of this is clearly observed in practice 
on the minimum and maximum volumes, given in Table 
The variation of the communication volume among differ¬ 
ent processors is significantly reduced (resp. MIN 238 MB 
and MAX 363 MB) than that using Flat-Tree (resp. MIN 
156 MB and MAX 600 MB). The standard deviation is sig¬ 
nificantly reduced from 72 MB to 24 MB, confirming the 
efficiency of the approach. 


When fewer processors are used to perform PSellnv, the 
communication load imbalance might not be so severe. Fig- 
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Figure 5: Communication volume heat map of Col-Bcast. 



Figure 6: Communication volume heat map of Col-Bcast 
using Flat-Tree on 256 processors 


processors. 

Altogether, our experimental results demonstrate that the 
use of a binary tree to organize messages in a restricted col¬ 
lective communication does mitigates the inherent load im¬ 
balance of the Flat-Tree communication pattern. However, 
since multiple restricted collective communications may take 
place at the same time with some of the processors partici¬ 
pating in all of them, the binary trees associated with these 
collective calls have to be built in such a way that these pro¬ 
cessors are not always picked as internal nodes of the binary 
trees. This can be achieved using the proposed Shifted 
Binary-Tree communication pattern. 


ure[^ depicts the communication volume heat map of Col- 
Bcast for the same audikw_l matrix running on a 16-by-16 
processor grid using the Flat-Tree scheme. In this case, 
the average volume is 1185.77 MB, while the standard devi¬ 
ation is 96.02 MB, corresponding to 8% of the average. This 
is sharply lower than the 23.7% standard deviation when 
PSellnv is carried out on 4,096 processors. 



(a) Flat-Tree 


(b) Shifted Binary-Tree 


4.2 Impact on performance 

In this section, we assess the impact of using different tree- 
based restricted collective communication schemes on the 
overall performance of PSellnv, and compare the strong 
scaling of PSellnv using either (1) Flat-Tree, (2) Binary- 
Tree and (3) Shifted Binary-Tree communication patterns. 

The new implementation of PSellnv contains additional code 
improvements that do not fall into the scope of this paper. 
Therefore, to emphasize the impact of the different imple¬ 
mentations of restricted collective communication only, we 
use the wallclock timing measurements of the new PSellnv 
using the Flat-Tree approach as the baseline for compari¬ 
son. We also provide timings from our previous vO.7.3 re¬ 
lease of PSellnv for reference. This implementation also 
uses a Flat-Tree communication pattern. The wallclock 
time of the LU factorization based on SuperLU_DIST is 
also provided. This is a pre-processing step of PSellnv. The 
SuperLU_DIST timing results are also used here as a reference 
for evaluating the strong scaling PSellnv. 


Figure 7: Communication volume heat map of Row-Reduce 

The Row-Reduce operation can be seen as the reverse oper¬ 
ation of a broadcast. In this case, it is the amount of data 
received by each processor that we are concerned with. Heat 
maps corresponding to Flat-Tree and Shifted Binary-Tree 
are shown in Figures |7(a)| and |7(b)| respectively. We can 
clearly see that the Shifted Binary-Tree scheme results in 
a more balanced communication load distribution among all 


Every data point generated from the strong scaling experi¬ 
ments presented in this section corresponds to the average of 
6 runs. We report standard deviations by using error bars. 
We dehne the speedup factor and other ratios as the ratio 
between average values. 


Results for the DG_PNF14000 matrix are depicted in Fig¬ 
ure 
to t 


8 (a) We observe that switching from the Flat-Tree 


le Binary-Tree scheme leads to a reduction of the wall 
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Figure 8: Running times of PSellnv for two sample matrices 


clock time by a factor of 1.7 on average. The reduction fac¬ 
tor is larger when a larger number of processors are used 
in PSellnv. In particular, when more than 1,024 processors 
are used, the average speedup factor is 2.5. The speedup 
factor reaches 3.6 when the computation is performed on 
6,400 processors. Additional performance improvement can 
be seen when Shifted Binary-Tree scheme is used. In par¬ 
ticular, the average speedup factor is increased to 2.0. When 
more than 1,024 processors are used, the average speedup 
factor increases to 3.2. The maximum speedup reaches 5.Ox 
when 6,400 processors are used. 


l l \l Computation 


I Communication 


I Communication 



(a) Flat-Tree 


(b) Shifted Binary-Tree 


Switching from Flat-Tree to Binary-Tree also reduces the 
standard deviation of the wall clock time among multiple 
runs of the same code on the same input by a factor of 1.35 
on average. The reduction factor is 1.26 when the Shifted 
Binary-Tree is used. Compared to vO.7.3 of PSellnv pre¬ 
sented in [17| , the average reduction in standard deviation 
resulting from the use of Shifted Binary-Tree is more than 
3.19. 


Figure 9: Computation and communication times for 
DC_PNF14000 


tation is carried out on 256 and 4,096 processors respectively. 
The communication to computation ratio is reduced from 
11.8 to 1.9 when we switch from a Flat-Tree based commu¬ 
nication scheme a Shifted Binary-Tree based scheme. 


Similar observations holds for the audikw_l matrix. The 
strong scaling plot depicted in Figure [8(b) [ demonstrates the 
use of Binary-Tree and Shifted Binary-Tree allows us to 
scale the computation to 6,400 processors, whereas the seal- 
ability of Flat-Tree based PSellnv calculations is limited to 
less than 1,024 processors. The standard deviation in run¬ 
ning time is reduced by more than a factor of 4 when a large 
number of processors are used to run the same program with 
the same input multiple times. 


The improved scalability of PSellnv clearly results from 
a better implementation of restricted collective communi¬ 
cation which significant reduces communication overhead. 
This can also be seen from the ratio of computation and 
communication time. In Figures 9(a) and |9(b)[ we plot 
both the communication and computation time consumed 
by PSellnv for the DC_PNF14000 matrix when the compu- 


It is interesting to note that in both test cases, the benefit of 
using Shifted Binary-Tree is not so pronounced when the 
PSellnv is carried out among a small set of processors (e.g. 
256). This is due to the fact that in this case, several re¬ 
stricted collective communications take place within a single 
node of Edison, which has 24 cores. Because message passing 
is implemented as memory copies within shared memory on 
a single node, its cost is generally lower compared to intern¬ 
ode communication. Moreover, having a single send buffer 
in a Flat-Tree based scheme could enhance cache reuse and 
reduces the impact of issuing p messages compared to the 
logjP messages sent by a binary tree based collective com¬ 
munication scheme. Therefore, in practice, one can poten¬ 
tially use a “hybrid” scheme in which a Flat-Tree based 
collective communication is used when the communication 
is restricted to a relatively small number of processors and 
a Shifted Binary-Tree based scheme is used when a large 



























number of processors are involved. 

5. CONCLUSION AND FUTURE WORK 

We described several implementations of restricted collective 
communication in a parallel selected inversion algorithm. 
Each implementation uses the point-to-point MPI_Isend and 
MPI_Irecv functions available in a standard MPI library. 
However, they differ in the way each message is moved from 
one processor to another. We showed that a binary tree 
based data propagation scheme is far superior than a fiat 
tree based scheme when a large number of processors are 
involved in the collective communication. In particular, the 
binary-tree based scheme minimizes communication load im¬ 
balance and removes communication “hot spots”. 

The use of MPI_Isend and MPI_Irecv allows multiple collec¬ 
tive communications to be initiated at the same time. This 
is a desired feature that would allow us to exploit a higher 
level of concurrency in the PSellnv algorithm. In order to 
prevent a processor from becoming an internal node of mul¬ 
tiple binary trees, we developed a heuristic that involves 
applying a random circular shift to the list of receiving pro¬ 
cessor ranks. We demonstrated that such a heuristic leads to 
a significant improvement in the scalability of PSellnv. For 
instance, when 6,400 processors are used, we observe over 
5x speedup for test matrices. It also reduces the variation 
in running time when the same program is executed multi¬ 
ple times with the same input. Such variation is caused by 
inhomogeneous network architecture. Reducing this type of 
variation is extremely important for achieving scalable per¬ 
formance of the PEXSI algorithm in which multiple 

selected inversions are carried out simultaneously on differ¬ 
ent subgroups of processors. Although our implementation 
in this work is for symmetric matrices, the same commu¬ 
nication strategy can be naturally extended to asymmetric 
matrices, which is our work in progress. 
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